Team 4: Sam Cole, Alicia Hauglie, & Sai Gugulothu
Our data came from MLB.com, which has MLB player stats dating all the way back to 1876. We chose to focus on Active players, or those who are currently playing in the league. This dataset has hitting stats for all active players that have at least 3.1 plate appearances per team game played. These are the players that are starting almost every game for each team and getting the most at-bats in the league. So basically, these are the starting/regular players of each of the 30 MLB teams. In this data you can see player positions, batting averages, RBIs, and home runs, as well as many other stats for each player. The top 1000 rankings are included in our dataset. Ranking is determined solely by AVG, or batting average. Each row, or rank, is the stats for one player for one year, so there are players with multiple rows, as they have played multiple years. It’s interesting to note that there are 1001 rows, this is because two players have the same ranking at some point, giving us players in 1000 ranks. The MLB consists of both the American League and the National League, the only major difference between the two are what teams are in each and in the American League there is a “designated hitter” position that is not in the NL. It is necessary to include some definitions in order to better understand our data and the game of baseball in general.
RK: rank. The rank of the player (in a given year) based on their batting average.
Player: Name of the player.
Year: Year in which the player’s stats are from.
Team: The abbreviation of the 30 MLB team names. In the Stadiums file, “Team Name” shows the full name of the teams.
Pos: Player’s position.
- 1B: first base
- SS: shortstop
- 2B: second base
- 3B: third base
- CF: center field
- RF: right field
- LF: left field
- C: catcher
- DH: designated hitter (only in the AL)
- OF: outfielder
G: number of games in which the player appeared (in a given year).
AB: number of official at bats by a batter. (This is plate appearances minus sacrifices, walks, and “hit by pitches”.)
R: runs. The number of times a baserunner safely reaches home plate.
H: hits. The number of times a batter hits the ball and reaches a base safely (without the aid of an error.)
2B: number of times a batter hits the ball and reaches second base.
3B: number of times a batter hits the ball and reaches third base.
HR: numer of times a batter hits the ball and gets a home run.
RBI: runs batted in. The number of runs that come from a batter hitting the ball. (If bases are loaded and batter hits a HR, RBI is 4)
BB: walks. Four balls in an bat.
SO: strikeouts. Three strikes during an at bat.
SB: stolen base. Number of times a player has stolen a base.
CS: caught stealing. Number of times a player has gotten out while trying to steal a base.
AVG: batting average. The chance a player has of getting a hit during an at bat.
OBP: on base percentage. The chance a player will get on base during an at bat. How frequently they get on base per plate appearance.
SLG: slugging percent. The same as batting average but it takes into account singles, doubles, triples, and HRs. A higher SLG means a player is more “productive” when hitting.
OPS: on base plus slugging percentage. This is the ability of a player to get on base AND hit for power.
SF: number of times a runner tags up and scores after a batter’s fly out.
AO: fly outs. Total number of times a batter hit the ball and it was caught in the air, resulting in an out.
GO: ground outs. Number of times a batter has gotten out on a ground ball.
PA: plate appearances.
NP: number of pitches thrown during all of the batter’s plate appearances.
RBIAB: runs batted in per at bat.
HRAB: home runs per at bat.
BABIP: batting average on balls in play. When a player makes contact with the ball, what’s the chance they’ll get a hit? This does not account for strikeouts (because the ball is not put into play).
NPPA: number of pitches per plate appearance.
NPAB: number of pitches per at bat.
SOAB: number of strikeouts per at bat.
What do the batting average and OPS look like for all active players?
Which position has the best batting average?
What stats have strong correlations to one another?
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
Which position has the best batting average?
What teams have the best batting averages?
How did switching teams affect Albert Pujol’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Do Homerun hitters have higher Strike Out percentages?
What makes the Yankees and Red Sox such a good rivalry?
Does the average number of pitches in an at-bat correlate with batting average? Does it correlate with HRs?
Where does Buster Posey fall in terms of average BABIP?
Which teams hit the most homeruns?
Where are the stadiums of the 30 MLB teams?
Originally, we spent a large amount of time trying to scrape the data from the website using the SelectorGadget. After running into issues with expanding tables on the page, we decided to simply create an excel spreadsheet of the data, which we then imported and named the dataframe “dat”. To clean the data we removed the * before player names, as well as deleted the columns Player2 and Player3, which were just repeating the same info as Player, and a few variables were changed from characters to numeric. In order to answer some of our questions, we created a few variables by mutating existing columns. The variables created were: RBIAB, HRAB, BABIP, NP, NPPA, NPAB, and SOAB.
We began by getting some simple visualizations of the basic stats in our data:
What do the batting average and OPS look like for all active players?This shows us the batting averages for all active players (with each data point being a certain player in a certain year). If an AVG is .300 (or 300), that means the player has a 30% chance of getting a hit. For reference, 350 is a very good batting average. The outliers are players who played just a few games in a season, and happened to play really well during those games (thus making them very high outliers). This alone does not tell us how good a player is as different players have different goals, some hit to get on base, some hit with the goals of HRs, etc. The distribution would be bell shaped if worse-ranked players were also included (players who have AVGs less than .25). Since our dataset only includes player with at least 3.1 plate appearances per game, it would not include players with AVGs less than .25 because they would get pulled for not hitting well and therefore would not get to 3.1 plate appearances.
Both batting average and OPS are bell-shaped for active players.
Which position has the best batting average?| Pos | MEANAVG |
|---|---|
| 1B | 0.293 |
| SS | 0.284 |
| 2B | 0.289 |
| 3B | 0.285 |
| CF | 0.286 |
| RF | 0.284 |
| LF | 0.290 |
| C | 0.288 |
| DH | 0.275 |
| OF | 0.288 |
We found that Designated Hitters actually have the WORST batting averages. This could possibly be because we have less data for DH since they’re only in the American League. If we had data for ALL players in the MLB, not just the ones with 3.1+ plate appearances, we would see a different story, averages for all the positions would be much closer to one another and DH would be among the positions with the higher batting averages. We think that our data may show them as the worst hitters because most of the players who are DH’s are older players who are put in this position because they’re not very good at fielding anymore. Generally, the best players are going to bat and play the field.
What stats have strong correlations to one another?
To find what stats have strong correlations to one another we created a correlation matrix using several of the variables in our data and produced a heatmap to better visualize the relationships.
Does getting more pitches in an at Bat increase the odds of hitting a homerun?
Which position has the best batting average?
What teams have the best batting averages?
How did switching teams affect Albert Pujol’s stats?
Why is Mike Trout considered such a well-rounded player (possibly the greatest of all time)?
Do Homerun hitters have higher Strike Out percentages?
What makes the Yankees and Red Sox such a good rivalry?
Does the average number of pitches in an at-bat correlate with batting average? Does it correlate with HRs?
Where does Buster Posey fall in terms of average BABIP?
Which teams hit the most homeruns?
Where are the stadiums of the 30 MLB teams?
Steps for data wrangling and visualization Answers to the questions raised Conclusion